Pesquisa | Portal Regional da BVS

Evaluation of clustering and topic modeling methods over health-related tweets and emails.

Lossio-Ventura, Juan Antonio; Gonzales, Sergio; Morzan, Juandiego; Alatrista-Salas, Hugo; Hernandez-Boussard, Tina; Bian, Jiang.

Artif Intell Med ; 117: 102096, 2021 07.

Artigo em Inglês | MEDLINE | ID: mdl-34127235

RESUMO

BACKGROUND: Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. METHODS: We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). RESULTS: In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (Nâ¯=â¯286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for kâ¯=â¯2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. CONCLUSIONS: Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.

Assuntos

Correio Eletrônico , Mídias Sociais , Análise por Conglomerados , Comunicação , Humanos , Aprendizado de Máquina

Clinical named-entity recognition: A short comparison.

Lossio-Ventura, Juan Antonio; Boussard, Sebastien; Morzan, Juandiego; Hernandez-Boussard, Tina.

Proceedings (IEEE Int Conf Bioinformatics Biomed) ; 2019: 1548-1550, 2019 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-35463810

RESUMO

The adoption of electronic health records has increased the volume of clinical data, which has opened an opportunity for healthcare research. There are several biomedical annotation systems that have been used to facilitate the analysis of clinical data. However, there is a lack of clinical annotation comparisons to select the most suitable tool for a specific clinical task. In this work, we used clinical notes from the MIMIC-III database and evaluated three annotation systems to identify four types of entities: (1) procedure, (2) disorder, (3) drug, and (4) anatomy. Our preliminary results demonstrate that BioPortal performs well when extracting disorder and drug. This can provide clinical researchers with real-clinical insights into patient's health patterns and it may allow to create a first version of an annotated dataset.

Clustering and topic modeling over tweets: A comparison over a health dataset.

Lossio-Ventura, Juan Antonio; Morzan, Juandiego; Alatrista-Salas, Hugo; Hernandez-Boussard, Tina; Bian, Jiang.

Proceedings (IEEE Int Conf Bioinformatics Biomed) ; 2019: 1544-1547, 2019 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-35463811

RESUMO

Twitter became the most popular form of social interactions in the healthcare domain. Thus, various teams have evaluated Twitter as an additional source where patients share information about their healthcare with the potential goal to improve their outcomes. Several existing topic modeling and document clustering applications have been adapted to assess tweets showing that the performances of the applications are negatively affected due to the nature and characteristics of tweets. Moreover, Twitter health research has become difficult to measure because of the absence of comparisons between the existing applications. In this paper, we perform an evaluation based on internal indexes of different topic modeling and document clustering applications over two Twitter health-related datasets. Our results show that Online Twitter LDA and Gibbs LDA get a better performance for extracting topics and grouping tweets. We want to provide health practitioners this comparison to select the most suitable application for their tasks.

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA